Trying out the new DataFrame package in .NET Core

Yesterday, Prashanth Govindarajan posted an article about DataFrame on the .NET Blog. I got excited and wanted to try the library as soon as I could. In this post, I will explain what the library is intended for and what are my thoughts on it.

The package

The DataFrame related classes were introduced in the package Microsoft.Data.Analysis. The source code is available in the corefxlab repository, Microsoft.Data.Analysis.

This means that the package is experimental and is currently in preview.

Installing

As with all NuGet packages, you can install it either by using the Visual Studio NuGet Package Manager or using the CLI: dotnet add package Microsoft.Data.Analysis.
Note: You can also install the package in a .NET Jupyter Notebook.

Functionalities

Loading data from CSV (not mature enough).
Easy column/row selection.
Join/Merge DataFrames.
GroupBy.
Handle Null values.
And others…

Example

Creating Data

In this example, we’ll be creating data instead of loading it from a CSV, since DataFrame.LoadCsv doesn’t handle null values and all structs (e.g. DateTime) yet.

string[] names = { "John", "Ahmed", "Chris", "Albert" };
int[] salaries = { 20000, 30000, 40000, 10000 };
DateTime[] birthdays = { DateTime.Parse("23/4/1990"), DateTime.Parse("4/5/1982"),
    DateTime.Parse("2/1/1980"), DateTime.Parse("9/10/1994") };
string[] departments = { "Development", "Development", "HR", null };

var idColumn1 = new PrimitiveDataFrameColumn<int>("Id", new int[] { 1, 2, 3, 4 });
var nameColumn = new StringDataFrameColumn("Name", names);
var birthdayColumn = new PrimitiveDataFrameColumn<DateTime>("Birthday", birthdays);
var salaryColumn = new PrimitiveDataFrameColumn<int>("Salary", salaries);

var idColumn2 = new PrimitiveDataFrameColumn<int>("EmployeeId", new int[] { 3, 1, 2, 4 });
var departmentColumn = new StringDataFrameColumn("Department", departments);

var employeesDf = new DataFrame(idColumn1, nameColumn, birthdayColumn, salaryColumn);
var departmentsDf = new DataFrame(idColumn2, departmentColumn);

The example data is very simple, we have employees (id, name, birthday and salary) and which department they belong to.

Note: This code would change to 2 lines when LoadCsv works properly.

DataFrame information and description

employeesDf.PrettyPrint();

Id	Name	Birthday	Salary
1	John	23/04/1990	20000
2	Ahmed	04/05/1982	30000
3	Chris	02/01/1980	40000
4	Albert	09/10/1994	10000

departmentsDf.PrettyPrint();

EmployeeId	Department
3	Development
1	Development
2	HR
4	null

employeesDf.Info().PrettyPrint();

Info	Id	Name	Birthday	Salary
DataType	System.Int32	System.String	System.DateTime	System.Int32
Length (excluding null values)	4	4	4	4

departmentsDf.Info().PrettyPrint();

Info	EmployeeId	Department
DataType	System.Int32	System.String
Length (excluding null values)	4	3

I would’ve hoped to see more information in Info, for example the memory usage like in pandas.

employeesDf.Description().PrettyPrint();

Description	Id	Birthday	Salary
Length (excluding null values)	4	4	4
Max	4	null	40000
Min	1	null	10000
Mean	2.5	null	25000

PrettyPrint

DataFrame.ToString only takes into account the maximum length of columns when printing the dataframe, which doesn’t always work since some column values might be longer.
I created a function that does the opposite to get some good looking tables, the code is available at the end of the post.

Fill nulls with a value

departmentColumn.FillNulls("Other", inPlace: true);

In the future, it would be nice to have more ways to fill nulls, for example the most present value (in this case Development).

Creating a column from an existing one by applying a function

var currentYear = DateTime.Now.Year;
employeesDf["Age"] = birthdayColumn.Apply(d => currentYear - d.Year);
employeesDf.PrettyPrint();

For now, DataFrame.Apply only handle same-type function, meaning that if your colum is of type DateTime, the output needs to be the same, which doesn’t work for our example (and for other use cases).

I opened an issue in the repo (here) and I’m working on an PR, for now, I’m using a custom extension method that will be available at the end of the post too.

Id	Name	Birthday	Salary	Age
1	John	23/04/1990 00:00:00	20000	29
2	Ahmed	04/05/1982 00:00:00	30000	37
3	Chris	02/01/1980 00:00:00	40000	39
4	Albert	09/10/1994 00:00:00	10000	25

Normalize a column

Note: This example intends to show how the columns can handle operations.

var minSalary = (float)(int)salaryColumn.Min();
var maxSalary = (int)salaryColumn.Max();
employeesDf["NormalizedSalary"] = (salaryColumn - minSalary) / (maxSalary - minSalary);
employeesDf.PrettyPrint();

Id	Name	Birthday	Salary	Age	NormalizedSalary
1	John	23/04/1990 00:00:00	20000	29	0.3333333333333333
2	Ahmed	04/05/1982 00:00:00	30000	37	0.6666666666666666
3	Chris	02/01/1980 00:00:00	40000	39	1
4	Albert	09/10/1994 00:00:00	10000	25	0

Join 2 DataFrames

var df = employeesDf.Merge<int>(departmentsDf, "Id", "EmployeeId", joinAlgorithm: JoinAlgorithm.Inner);
df.PrettyPrint();

Id	Name	Birthday	Salary	Age	NormalizedSalary	EmployeeId	Department
3	Chris	02/01/1980 00:00:00	40000	39	1	3	Development
1	John	23/04/1990 00:00:00	20000	29	0.3333333333333333	1	Development
2	Ahmed	04/05/1982 00:00:00	30000	37	0.6666666666666666	2	HR
4	Albert	09/10/1994 00:00:00	10000	25	0	4	Other

I’ll have to admit that I got confused at first, thinking that Join should do this while Merge would only merge without a condition.

Drop a column and sort by a column

df.Columns.Remove("EmployeeId");
df = df.Sort("Id");
df.PrettyPrint();

Id	Name	Birthday	Salary	Age	NormalizedSalary	Department
1	John	23/04/1990 00:00:00	20000	29	0.3333333333333333	Development
2	Ahmed	04/05/1982 00:00:00	30000	37	0.6666666666666666	HR
3	Chris	02/01/1980 00:00:00	40000	39	1	Development
4	Albert	09/10/1994 00:00:00	10000	25	0	Other

Mean salary by department

var employeesByDepartment = df.GroupBy("Department");
employeesByDepartment.Mean("Salary").PrettyPrint();

Department	Salary
Development	30000
HR	30000
Other	10000

Value counts of department

departmentColumn.ValueCounts().PrettyPrint();

Values	Counts
Development	2
HR	1
Other	1

Sample rows

df.Sample(2).PrettyPrint();

Note: This method can give duplicate rows. I created an issue and hopefully it’ll be fixed next preview.

Conclusion

I’m excited to see what this library becomes in the future. It still needs a lot of work but I hope it gets mature enough to get out of the lab and be officially supported, it can be a nice addition to .NET.

If you’re interested in the code used in this post:

See you soon!